Basic Plots

Reference:

Several ways to produce plots are by using:

  • base + graphics packages (which come with R installation)

  • ggplot2

  • plotly

  • lattice

It seems that lattice is dying out, while plotly is picking up more popularity, as a language agnostic sofware that can be used in several languages.

Scatter Plots

data(iris)
x = iris$Petal.Length
y = iris$Sepal.Length

plot(x,y,pch=16, xlab='Petal Length',
                 ylab='Sepal Length',
                 main='Iris Sepal Length VS Petal Length',
                  col=iris$Species)

legend("topleft",legend=unique(iris$Species),
       fill=unique(iris$Species),bty='n')

As mentioned earlier, there are various cheatsheets and websites with R base graphs (for example, this one).

type =

  • “p” for points

  • “l” for lines

  • “b” for both

  • “c” for the lines part alone of “b”,

  • “o” for both ‘overplotted’,

  • “h” for ‘histogram’ like (or ‘high-density’) vertical lines,

  • “s” for stair steps,

  • “S” for other steps, see ‘Details’ below,

  • “n” for no plotting.

Unfortunately, plot() does not allow putting another graph on the top of previousely generated. Instead, we use function lines() for adding a curve, or abline() for adding a straight line, or points() for adding scatter plot.

x = seq(-10,10,by=0.1)
y1 = sin(x)
y2 = cos(x)
plot(x,y1, type="l", col="blue")
lines(x,y2,col="red",lty=2)
abline(a=0,b=0.1,col="green")
abline(v=5,col="magenta",lwd=4,lty=3)
points(x=seq(-10,10,by=1), y=exp(seq(-10,10,by=1)), col="cyan", pch=3) #pch=3 means +

In abline, \(y=a + b x\), so a = intercept; b = slope; also, h = horizontal; v = vertical

Histograms

We use mtcars from the built-in package datasets.

head(mtcars, 4)
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

hist(mtcars$mpg)

hist(mtcars$mpg, breaks=12, col="red")

Normal distribution; sample size \(n=100\)

set.seed(12)
x = rnorm(n=100, mean=0, sd=1)
hist(x, breaks=50, xlim=c(-4,4))

sample size \(n=500\)

x = rnorm(n=500, mean=0, sd=1)
hist(x, breaks=50, xlim=c(-4,4))

normalizing (scaling)

#x is from previous slide
hist(x, breaks=50, freq=F, xlim=c(-4,4))

sample size \(n=1000\)

x = rnorm(n=1000, mean=0, sd=1)
hist(x, breaks=50, freq=F, xlim=c(-4,4))

png("four-hists.png", width=1000, height=500) #open a new png file
t=(-400:400)/100; set.seed(12)
par(mfrow=c(2,2))   
for(n in c(100, 500, 1000, 5000)){
    x = rnorm(n=n, mean=0, sd=1)
    hist(x, breaks=50, freq=F,main=paste("n=",n),
         xlim=c(-4,4), ylim=c(0,0.8))
    lines(t,dnorm(t,0,1),lwd=2,col="blue")
} #end of the for-loop
dev.off()  #close the current (i.e. png) file 

Density of \(X\sim\mathcal{N}(\mu; \sigma^2)\)

\[f(x)= {1\over \sqrt{2\pi\sigma^2}}\,e^{-{(x-\mu)^2\over 2\sigma^2}}\]

For \(X\sim\mathcal{N}(\mu=0;\,\sigma^2=1),\)

\[f(x)= {1\over \sqrt{2\pi}}\,e^{-{x^2\over 2\sigma^2}}\]

\(X\sim \mathcal{B}in(10; p=0.1)\)

## n=sample size;  size=# trials
x = rbinom(n=50, size=10, prob=0.1) 
hist(x, freq=F, breaks=seq(from=-0.5,to=10.5,by=1))

\(X\sim \mathcal{B}in(10; p=0.5)\)

## n=sample size;  size=# trials
x = rbinom(n=50, size=10, prob=0.5) 
hist(x, freq=F, breaks=seq(from=-0.5,to=10.5,by=1))

Expected Value (Mean) and Sample Mean

  • Discrete case: \(R_X=\{x_1,x_2,x_3,\dots\}\)

\[E[X]=\mu_X = \sum_i x_i \cdot P(X=x_i)\]


  • Continuous case \(X\sim f\) - density

\[E[X] = \mu_X = \int_{-\infty}^{\infty}x\cdot f(x)\,dx\]

  • sample mean (in either case):

\[\displaystyle \bar{X} = {1\over n}\sum_{i=1}^n X_i\]

Variance and Standard Deviation

  • Discrete case: \(R_X=\{x_1,x_2,x_3,\dots\}\)

\[\text{var}[X]=\sigma^2_X = \sum_i (x_i-\mu_X)^2 \cdot P(X=x_i)\]

  • Continuous case \(X\sim f\) - density

\[ \text{var}[X] = \sigma^2_X = \int_{-\infty}^{\infty}(x-\mu_X)^2\cdot f(x)\,dx\]

  • standard deviation (either case): \(\sigma_X = \sqrt{\sigma^2_X}\)
  • sample variance (either case): \(\displaystyle S^2 = {1\over n-1}\sum_{i=1}^n (X_i-\bar{X})^2\)

attach(mtcars)

mean(mpg)
[1] 20.09062
var(mpg)
[1] 36.3241
sd(mpg)
[1] 6.026948

For numerical data that come from continuous random variable, we can estimate density based on the histogram. This is so called kernel density plot.

# Kernel Density Plot
dsty = density(mtcars$mpg, kernel="gaussian") # returns the Kernel Density Estimation (KDE)
str(dsty)
List of 7
 $ x        : num [1:512] 2.97 3.05 3.12 3.2 3.27 ...
 $ y        : num [1:512] 0.000114 0.000125 0.000137 0.00015 0.000164 ...
 $ bw       : num 2.48
 $ n        : int 32
 $ call     : language density.default(x = mtcars$mpg, kernel = "gaussian")
 $ data.name: chr "mtcars$mpg"
 $ has.na   : logi FALSE
 - attr(*, "class")= chr "density"

x = mtcars$mpg
hist(x, freq=FALSE, col="red", xlab="Miles Per Gallon",
   main="Histogram with Normal Curve")

# Kernel Density Plot
dsty = density(mtcars$mpg, kernel="gaussian") # returns the Kernel Density Estimation (KDE)

lines(dsty$x, dsty$y, type='l', lwd=2) # plots the results

Box Plots

data(airquality)
air = airquality
head(air)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
unique(air$Month)
[1] 5 6 7 8 9

attach(air)
par(mfrow=c(2,1), mai=c(0.5,0.6,0.5,0.6))
hist(Solar.R, xlim=c(0,350))
boxplot(Solar.R, horizontal=T, main="Boxplot of Solar.R", ylim=c(0,350), axes=F)
axis(1)

Median, 1st Quartile, 3rd Quartile

  • collected data: \(x_1, x_2, x_3, \dots , x_n\)

  • ordered data: \(x_{(1)}\leq x_{(2)} \leq \dots \leq x_{(n)}\)

  • median: \(\displaystyle M = \begin{cases} x_{({n+1\over 2})} &, \text{ if }n \text{ is odd} \\ {x_{({n\over 2})}+x_{({n\over 2}+1)}\over 2} &, \text{ if } n \text{ is even} \end{cases} \hspace{2cm} \boxed{1}\,\,\boxed{2}\,\,\boxed{3}\,\,\boxed{4}\,\,\boxed{5}\,\,\boxed{6}\,\,\boxed{7}\,\,\boxed{8}\,\,\boxed{9}\)

  • first quartile \((Q_1)\): median of the “lower” part of the (sorted) data, i.e. median of the part of the data less than \(M\).

  • second quartile: \(Q_2 = M\)

  • third quartle \((Q_3)\): median of the “upper” part of the (sorted) data, i.e. median of the part of the data greater than \(M\).

median(Solar.R)
[1] NA
median(Solar.R, na.rm=T)
[1] 205
quantile(Solar.R, probs=seq(0,1,by=0.25), na.rm=T)
    0%    25%    50%    75%   100% 
  7.00 115.75 205.00 258.75 334.00 
c(min(Solar.R,na.rm=T), max(Solar.R,na.rm=T))
[1]   7 334

summary(Solar.R)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    7.0   115.8   205.0   185.9   258.8   334.0       7 
summary(air)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

library(dplyr)
months = recode_factor(air$Month, '5'="May", '6'="June", '7'="July",
                       '8'="August", '9'="September")
unique(months)
[1] May       June      July      August    September
Levels: May June July August September

Note that we used recode_factor() from dplyr package.

boxplot(Solar.R~months, data=air, col=topo.colors(length(unique(months))))

We can also make a horizontal box plot. For example, we use the same data as above but with horizontal=TRUE, and instead of topo.colors() we make our own choice of each color.

boxplot(Solar.R~months, data=air,
        horizontal=TRUE,
        col=c("purple", "blue", "green", "yellow", "red"))
legend("topright",legend=c('may','june','july','aug','sept'),
       fill=c("purple", "blue", "green", "yellow", "red"))

To add a curve/line/trace on the top of already existing plot, we need one of the “annotating” functions. We use abline() here to add horizontal line through the mean of Solar.R.

boxplot(Solar.R~months, data=air, col=topo.colors(length(unique(months))))
abline(h=mean(air$Solar.R,na.rm=TRUE), col='red', lty='dashed', lwd=2)

Bar Charts

data(mtcars)
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
counts = table(mtcars$cyl)
counts
 4  6  8 
11  7 14 

counts = as.vector(counts)
xx <- barplot(counts, main="Car Distribution by Cylinders",
   xlab="Number of Cylinders",
   ylim = c(0,max(counts)+3), 
   col="deeppink")
text(x=xx, y=counts+1, label=as.character(counts), 
     cex=1.3, col="darkblue")

Stacked Bar Plot with Colors and Legend

counts = table(mtcars$vs, mtcars$cyl)
barplot(counts, main="Car Distribution by Cylinders and VS (v-shaped engine)",
  xlab="Number of Cylinders", col=c("darkblue", "deeppink"),
  xlim = c(0,4.5),
  legend = rownames(counts))

By-group Bar Plot

counts = table(mtcars$vs, mtcars$cyl)
barplot(counts, main="Car Distribution by Cylinders and VS",
  xlab="Number of Cylinders", col=c("darkblue","deeppink"),
  legend = rownames(counts), beside=TRUE)

Pie Charts

counts = as.vector(table(mtcars$cyl))
lbls <- c("4","6","8")
pct <- round(counts/sum(counts)*100)
lbls <- paste(lbls, pct, sep="; ") # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(counts,labels = lbls, col=rainbow(length(lbls)),
   main="Pie Chart of Cylinders in the Data")

A fancier, 3D exploded pie can be obtained using the package plotrix (don’t forget to install first)

##install.packages("plotrix")
library(plotrix)
counts = as.vector(table(mtcars$cyl))
lbls <- c("4","6","8")
pct <- round(counts/sum(counts)*100)
lbls <- paste(lbls, pct, sep="; ") # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie3D(counts,labels = lbls, col=rainbow(length(lbls)), explode=0.1,
   main="Pie Chart of Cylinders in the Data")

Colors in Base R Plots

  • Use col parameter to control color in functions like plot(), hist(), barplot(), pie(), etc.

  • Apart from setting col="<name>", you can specify red, green and blue (RGB) components of the color you want. The intensity of RGB components can be given either in decimal, or hexadecimal system. For decimal representation, you can use rgb() function:

rgb(red, green, blue, alpha, names = NULL, maxColorValue = 1)
  • alpha (a.k.a alpha channel) measures opacity, from 0 to maxColorValue.

  • eg: specifying red=1,green=0,blue=1, withmaxColorValue=1`, gives us magenta (purple) color.

hist(mtcars$mpg, breaks=12, col=rgb(1,0,1))

hist(mtcars$mpg, breaks=12, col=rgb(1,0,1, alpha=0.3))

hist(mtcars$mpg, breaks=12, col=rgb(85,0,85, alpha=180.9, maxColorValue=255))

The two most common representations of RGB values are real numbers from [0,1], or integers from 0 - 255 (so, \(256 = 2^8\) integers in total, which means we use 1 byte of memory for intensity for each of the three RGB colors).

You can also express intensity of RGB colors in hexadecimal system. Recall that the digits in hexadecimal system are \(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F\). The notation for color in hexadecimal system starts with #, followed by two hexadecimal digits for Red, followed by two hexadecimal digits for Green, and followed by two hexadecimal digits for Blue.

For example, maroon: RGB = (140, 29, 64), where each intensity is out of 255. Since

\((140)_{10} = (1\cdot 10^2 + 4\cdot 10^1 + 0\cdot 10^0)_{10} = (8\cdot 16^1 + \underbrace{12}_{=(\text{C})_{16}}\cdot 16^0)_{10} = (8\text{C})_{16}\)

\((29)_{10} = (2\cdot 10^1 + 9\cdot 10^0)_{10} = (1\cdot 16^1 + \underbrace{13}_{=(D)_{16}}\cdot 16^0)_{10} = (1\text{D})_{16}\)

\((64)_{10} = (4\cdot 16^1 + 0\cdot 16^0)_{10} = (40)_{16}\)

we can represent maroon as “#8C1D40”. Similarly, gold = (255, 198, 39) = “#FFC627

## defining a vector of magenta, maroon, gold and cyan
mycolors = c("#FF00FF","#8C1D40", "#FFC627", "#00FFFF")
png("four-hists-w-col.png", width=1000, height=500) #open a new png file
t=(-400:400)/100; set.seed(12)
par(mfrow=c(2,2))   
iter = 0
for(n in c(100, 500, 1000, 5000)){
    iter = iter + 1 ##number of iterations by now
    x = rnorm(n=n, mean=0, sd=1)
    hist(x, breaks=50, freq=F,main=paste("n=",n),
         col=mycolors[iter], xlim=c(-4,4), ylim=c(0,0.8))
    lines(t,dnorm(t,0,1),lwd=2,col="blue")
} #end of the for-loop
dev.off()  #close the current (i.e. png) file 

the rest are “scratch” slides

## defining a vector of magenta, maroon, gold and cyan
mycolors = c("#FF00FF","#8C1D40", "#FFC627", "#00FFFF")
t=(-400:400)/100
par(mfrow=c(2,2))
iter = 0
for(iter in 1:4){
  n = 10^iter
  temp=rep(0,1000)
  for(i in 1:1000){
  temp[i]=(mean(rbinom(n,10,0.1))-1)/sqrt(0.9/n)
  }
  hist(temp,40,freq=FALSE,main=paste("n=",n),col=mycolors[iter],xlim=c(-4,4))
  lines(t,dnorm(t,0,1),lwd=2,col="blue")
} #end of the outer for-loop

The most important function for base plots is the plot() function, which we already used. Let’s generate density functions of chi-square distributions with degrees of freedom df = 1, 2, 3 and 4. We start with df = 1.

x = seq(-10,10,by=0.1)
y = sin(x)
plot(x,y, type="p", col="blue")

t=(1:400)/100
y1 = dchisq(t, df=1)
plot(t, y1, type="l",lwd=2,col="black",ylim=c(0,1.5),ylab="density")

Note that we used function dchisq() to generate \(y\) coordinates of the density of \(\chi^2\) distribution. Also, type="l" means the type of the plot is a line-plot, i.e. solid line. The various types are:

t=(1:400)/100
y1 = dchisq(t,df=1)
plot(t, y1, type="l",lwd=2,col="black",ylim=c(0,1.5),ylab="density")
## now, adding new plots
y2 = dchisq(t,df=2)
y3 = dchisq(t,df=3)
y4 = dchisq(t,df=4)
lines(t, y2, lwd=2,col="red")
lines(t, y3, lwd=2,col="blue")
lines(t, y4, lwd=2,col="green")

## and we add a legend, too

legend(x=3,y=1,legend=c("df=1", "df=2", "df=3", "df=4"),
       col=c("black", "red", "blue", "green"), lty="solid", cex=0.8)

Note that x=3, y=1 represent the position of the upper left point of the legend. We also used lty="solid" for solid line type (which is default). Instead of the word "solid", we could use lty=2. You can google which numbers represent which line types. Also, cex for the expansion (i.e. size) of the object. Play with different values to see how it affects the plot.

To add a straight line to a plot, you can use lines(). However, this requires creating vectors x and y. Instead, you one can use abline() for linear function \(y=ax+b\). The function abline() has input parameters a and b (the slope and the intercept). The abline() function can be used for vertical and horizontal lines, as well (with parameters v and h, respectively)

abline(a=0.5, b=2, lwd=4, lty='dashed')

t=(-400:400)/100
par(mfrow=c(2,2))
for(n in c(10, 100, 1000, 10000)){
  temp=rep(0,1000)
  for(i in 1:1000){
      temp[i]=(mean(rbinom(n,10,0.1))-1)/sqrt(0.9/n)
      }
      hist(temp,40,freq=F,main=paste("n=",n),xlim=c(-4,4))
      lines(t,dnorm(t,0,1),lwd=2,col="blue")
}#end of the outer for-loop

t=(-400:400)/100
par(mfrow=c(2,2))
for(n in c(10, 100, 1000, 10000)){
  temp=rep(0,1000)
  for(i in 1:1000){
      temp[i]=(mean(rbinom(n,10,0.1))-1)/sqrt(0.9/n)
      }
      hist(temp,40,freq=F,main=paste("n=",n),xlim=c(-4,4))
      lines(t,dnorm(t,0,1),lwd=2,col="blue")
}#end of the outer for-loop